LAGSwin: Local attention guided Swin-transformer for thermal infrared sports object detection

Compared with visible light images, thermal infrared images have poor resolution, low contrast, signal-to-noise ratio, blurred visual effects, and less information. Thermal infrared sports target detection methods relying on traditional convolutional networks capture the rich semantics in high-level features but blur the spatial details. The differences in physical information content and spatial distribution of high and low features are ignored, resulting in a mismatch between the region of interest and the target. To address these issues, we propose a local attention-guided Swin-transformer thermal infrared sports object detection method (LAGSwin) to encode sports objects’ spatial transformation and orientation information. On the one hand, Swin-transformer guided by local attention is adopted to enrich the semantic knowledge of low-level features by embedding local focus from high-level features and generating high-quality anchors while increasing the embedding of contextual information. On the other hand, an active rotation filter is employed to encode orientation information, resulting in orientation-sensitive and invariant features to reduce the inconsistency between classification and localization regression. A bidirectional criss-cross fusion strategy is adopted in the feature fusion stage to enable better interaction and embedding features of different resolutions. At last, the evaluation and verification of multiple open-source sports target datasets prove that the proposed LAGSwin detection framework has good robustness and generalization ability.


Introduction
Target detection in thermal infrared images aims to identify the position and model of objects of interest (such as pedestrians, vehicles, and indoor sports players), etc. [1].With the successful application of deep learning technology in many fields, thermal infrared targets are based on deep learning frameworks [2].Detection algorithms have also made significant progress in recent years.Most of the existing thermal infrared detection algorithms focus on extracting high-level feature information of target objects in thermal infrared images, ignoring the physical appearance properties and low-level semantics of targets.Compared with remote sensing and natural images, thermal infrared images have significant semantic ambiguity between different categories due to low resolution, and contrast [3], which significantly challenges thermal infrared target detection.Obtaining better semantic details from ratio and contrast has become a research hotspot.
To obtain better thermal infrared target detection performance and detection efficiency, the existing advanced thermal infrared target detectors are mainly two-stage RCNN framework and lightweight single-stage Yolo series [4], in which the two-stage RCNN [5,6] consists of a region proposal network (RPN) [7] Generate high-quality regions of interest from horizontal anchors for efficient features, and utilize bounding regression boxes for regression and classification.It is worth noting that the horizontal anchor point quickly leads to a severe imbalance between the bounding box and the directional target object.At the same time, there are significant differences in scale, shape, and color between the sports target objects and when the target moves to a specific period.There may be overlapping and dense phenomena.To alleviate these problems, a Region of Interest (RoI) Rotator was recently proposed to convert horizontal anchors to rotational anchors, avoiding redundant-computations brought by many anchors.However, this ROI rotation operation is mainly used for target detection in remote sensing images and less for thermal infrared target detection tasks.The single-stage detection method is mainly based on fast efficiency.Compared with the two-stage RCNN framework, the accuracy needs to be improved.
The main contributions of the LAGSwin detection framework proposed in this paper are as follows: • A local attention-guided Swin-transformer is designed to form mutual embedding between high-level features and low-level features; when the high-level features have insufficient representation ability, many low-level features are embedded in the high-level features.Semantics, when low-level semantic information is weak, embedding a large number of high-level semantics helps resolve the semantic ambiguity between different classes.
• A criss-cross fusion strategy is designed to make the target semantics in the low-resolution feature map have a strong representation through cross-fusion.And describe the thermal infrared target from three levels: low, medium, and high, and establish effective spatial relationships and long-term dependency.At the same time, the interaction between different hierarchical features is realized, and the spatial details of sports targets in thermal infrared images are better obtained.
• Introducing its convolutional filter in the detection stage, encoding the orientation information while reducing the inconsistency between classification and localization regression and enabling us to generate high-quality anchor and alignment features for Accurate thermal detection of sports objects in infrared images.Finally, the evaluation and demonstration are carried out on the open-source thermal infrared sports dataset and other RGB sports datasets, and the proposed LAGSwin detection framework achieves the best performance in both speed and accuracy.We design a weighted loss function to tune and optimize the proposed framework.
The rest of this article is organized as follows.The second section details the related work related to thermal infrared target detection.The next section introduces the proposed LAGSwin detection framework and describes the functions and principles of different components in detail.The presents the experimental results and analysis of different detection methods.The conclusions and next research plans are described in the last section.

Related work
We will describe the progress and current status of thermal infrared object detection from two perspectives, one-stage, and two-stage detection methods.

Two-stage detection methods
With the successful application of deep learning technology in many fields, object detection has developed significantly in recent years.At present, the standard thermal infrared target detection algorithms are mainly divided into two types, namely, two-stage detection and single-stage detection methods.Among them, the two-stage detection method realizes feature extraction by generating a sparse ROI set and performs boundary regression and object classification in the second stage.For example, Li et al. [8] designed a faster light-sensing two-stage RCNN detection model for the differences between optical and thermal infrared images.They discussed the feature extraction capabilities of various convolutional networks in depth.Aiming at the problem of reliable and efficient object detection in thermal infrared images, Dai X et al. [5] propose a novel object detection method based on convolutional networks, which can be optimized and predicted in an end-to-end manner.Dai et al. [9] proposed a multi-task Faster RCNN detector to evaluate the driving distance to improve driving safety.They improved the performance of thermal infrared object detection tasks by adjusting the feature extractor.Song et al. [4] created a segmentation template for the heat-generating part for the heat-generating components in the thermal-sensing image of a thermal infrared camera.They proposed a mask-based RCNN-based infrared image detection algorithm.Although the detection progress of these two-stage thermal infrared target detection algorithms is good, the model efficiency needs to be improved.At the same time, these methods often use a relatively simple convolution structure in the feature extraction stage and pay too much attention to thermal infrared images in the feature capture stage.The high-level semantics of the target ignores rich low-level semantic information.

One-stage detection methods
Compared with the two-stage target detector, the single-stage detector acts directly on the target and does not need to generate the ROI generation stage, so it lags behind the two-stage sensor in performance.For single-stage detection algorithms, Jiang C et al. [3] used yolo to capture the feature information of targets from thermal infrared images and proposed a UAV TIR target detection framework for thermal infrared photos and videos.Considering the poor performance of RGB images at night and in complex weather conditions, Kris ˇto M et al. [6].used YOLOv3, a detection model suitable for RGB images, to detect targets in thermal infrared images.Its detection speed and accuracy achieved good results.Competitiveness.Based on YOLOv5, Li S et al. [10] proposed a region-free object detector YOLO-FI based on the characteristics of thermal infrared images, that is, in the feature extraction stage, the cross-stage local connection in the shallow layer (cross-stage-partial -connections, CSP) module to expand and iterate, maximizing external features to improve the representation power of these features.Li L et al. [11] considering that most ship detection algorithms use artificial features to segment visible light image blocks accurately and are limited by factors such as illumination, clouds, and atmospheric waves in practical applications, they designed a complete yolo-based complex TIRSIs Ship Detection Method (CYSDM) in the context.For thermal infrared images, Hou Z et al. [12] designed a thermal infrared target detector M-YOLO that helps to integrate global context information.In the feature extraction stage, a top-down and bottom-up parallel feature fusion method is used, and the maximum limit is preserved.The representation ability of the feature is enhanced.To make up for the inability of traditional cameras to be used under harsh lighting conditions, Li W et al. [13] designed a nighttime thermal infrared pedestrian detection algorithm through Yolov3.Xue Y et al. [14] used compressed Darknet53 to obtain the feature information of two modalities.They used a weighted fusion strategy for feature fusion, proposing a thermal infrared pedestrian detection algorithm with multi-modal attention fusion.Although these thermal infrared detection methods have good detection efficiency, the detection accuracy needs to be improved.At the same time, they mainly use traditional convolution methods or simple weighted fusion strategies in the feature extraction stage, which are often quickly introduced in the feature transfer process-a large amount of redundant information.In addition, these methods mainly focus on detecting single structural targets in thermal infrared images and less on multi-category sports targets.
Sports objects may overlap, occlude or shadow during motion.Therefore, capturing more physical appearance attribute information, such as shape and size, in the feature extraction process is necessary to enhance high-level semantics.For example, Masuda T et al. [15] proposed a motion video behavior detection method based on self-supervised feature learning and target detection, which introduced target detection into the process and realized the action detection of multiple people by tracking each person.Considering the high coupling between different features, Zhao J et al. [16] designed a non-global attention mechanism: a local ushaped attention decoupling network.Jiang X et al. [17] propose a new complementary transformer network (MCNet) for object detection in RGB and thermal infrared images, that is, introduce a transformer-based feature extraction module to efficiently extract hierarchical features of RGB and thermal images and attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion module, the complementary interaction of low-level features and semantic fusion of deep features are realized.Liu Z et al. [18] proposed a cross-modal fusion model for GRB and thermal infrared salient target detection-SwinNet.Driven by the Swin Transformer, the method extracts hierarchical features.It bridges the gap between the two modalities driven by the attention mechanism to sharpen salient object contours guided by edge information.Xu F et al. [19,20] considered that due to problems such as color cast and blur in underwater images, the features extracted directly from the backbone network often lack interesting and distinguishable features, which affects the performance of marine target detection.A novel exemplary ocean object detector based on an attention-based spatial pyramid pooling network and bidirectional feature fusion strategy is proposed to alleviate feature weakening and solve the ocean object detection problem.Then, a novel scale-aware feature pyramid structure SA-FPN is proposed to extract rich, robust features of underwater images and improve the performance of marine object detection.Wang H et al. [21,22] aim at minimizing the reconstruction loss between input data and binary codes for autoencoder-based hashing algorithms while ignoring the potential consistency and complementarity of multi-source data, proposes an autoencoder-based multi-view binary clustering hashing algorithm that dynamically learns an associative graph with low-rank constraints, and employs collaborative learning between the autoencoder and the associative graph to learn a unified binary code.Then, considering that most existing methods have to introduce additional clustering steps to produce the final clusters, significantly reducing the unified relationship between graph learning and clustering, a multi-view clustering based on graph collaboration is proposed.Class Methods (MCGC).Xu F et al. [23] considered that the synthetic images are unrealistic enough, affecting the generalization to natural test images.They introduced segmentation masks to construct red, green, and blue mask pairs as input.They also designed an attention-guided style transfer network, learned style features from attention and background regions, and learned content features from entire and attention regions.The feature extraction process considers the target object's lower layers more.Semantics, but the interaction ability between high-level and low-level features is insufficient.At the same time, it is challenging to balance high-level semantics and low-level features when establishing longterm dependencies.Therefore, we propose a local attention-guided Swin-transformer for thermal infrared sports object detection (LAGSwin) to address these limitations.In addition, our proposed thermal infrared moving target detection framework (LAGSwin) can be practically applied to thermal infrared imaging fault diagnosis [24] and other thermal infrared image target detection [25] tasks.

Our proposed method
In this section, we first describe the overall architecture of the proposed LAGSwin detection framework; secondly, we elaborate on the feature extraction modules, namely the local attention-guided Swin-transformer component and the criss-cross PAFPN fusion component.Finally, the detection module with aligned convolutional filters is introduced, and the weighted loss function proposed in this paper is elaborated.

Overview of the proposed LAGSwin
Contextual semantic information and spatial details are essential for target detection.Although the traditional global pooling operation can effectively aggregate this information, when the background semantics of the target is complex, or the target scale transformation is large, this operation may not be possible because the target information is unclear.Fully capture contextual, local, and global spatial details.Therefore, we propose the Local Attention Guided Swintransformer [26][27][28][29][30][31]  discriminative ability by embedding high-level features in low-level semantics and embedding some important low-level semantics in high-level features to make up for the high-level semantics in describing physics.The lack of basic attribute information, such as appearance, forms complementarity between high-level and low-level features, prompting the network to use the target's spatial details better.In contrast, CCFM uses top-down and bottom-up targets for modeling to fully capture the contextual semantics of the targets, which enables the proposed framework to establish more effective long-term dependencies on targets.It is worth noting that the criss-cross fusion strategy (CCFM) not only achieves further interaction between features at different levels but also further refines the spatial details and reduces the use of redundant information.In ACDM, an aligned convolution filter is introduced to ensure accurate encoding of orientation information and, simultaneously, to reduce the inconsistency between classification confidence and localization regression.In addition, we design a weighted loss function to act on the classification branch to make the network converge better and achieve more accurate classification.

Local attention guided feature extraction module
This module is mainly composed of the Swin-transformer and Local Attention Guidance Layer (LAG), where the Swin-transformer is designed to explore the multi-scale local information of sports objects in thermal infrared images.At the same time, LAG is used for high-level and low-level features at different scales.Create interactions between them so there are dependencies between elements at different levels.
Assuming that the input thermal infrared image is x 2 R H×W×C , where, H, W, C represent the height, width and channel dimensions of the input feature, respectively, we use a set of dilated convolutions [32,33] to perform initial feature extraction to obtain the initial feature f0; secondly, the feature maps generated by the four stages of Swin-transformer are f 1 , f 2 , f 3 and f 4 respectively.It is worth noting that in the initial feature extraction process, to maximize the preservation of global semantic information, we use a set of weighted pooling operations after each group of dilated convolutions, namely the mean of global average pooling and maximum pooling, preventing Overfitting while maximally preserving international semantic details.The calculation of the feature map f 0 is shown in Eq.
Where, r represents the expansion coefficient; DConv(�) represents the dilation convolution operation; Conv 1×1 (�) represents the convolution operation of 1 × 1; Mean(�) represents the matrix mean operation; MaxPool(�) represents the maximum pooling operation; GAP(�) represents the global average pooling operation; Cat(�) represents feature splicing; To obtain better feature representation, at the same time, the complementarity between high-level features and low-level features is promoted.In addition, due to the differences in space and physical appearance of different targets, it isn't easy to form unified modeling.If the most common method is used to splice low-level features with high-level features, although it can bring a slight performance improvement, it will also cause redundant information.Therefore, to make full use of low-level features and, at the same time, to establish influential associations between them, we design a local attention guidance layer (LAG) under the condition of maximizing the preservation of spatial details and making up The gap between high-level and low-level features.The Local Attention Guidance Layer (LAG) operation on f 0 to f 4 is shown in Eq.
Where, f 0 s represents the local attention guidance features of different scales; α represents the local attention coefficient; s represents the feature map scale; LAG represents the local attention guidance operation; f s represents high-level features; f s−1 represents low-level features.
According to the equation, we can find that the operation of LAG is similar to residual connection in structure.When the expressive ability of low-level semantic information is better, it avoids excessive interference of high-level features, thus further emphasizing the importance of low-level features; that is to say, we designed this guiding method to help the representation of low-level semantic information, and at the same time, the expression of semantic information is enriched by the process of mutual embedding.

Criss-cross fusion module (CCFM)
Feature Pyramid Network (FPN) [34][35][36] in the feature aggregation stage, pyramid feature maps of different scales are usually obtained along the bottom-up path.This way, when the shallow features are transferred to the top layer, they must go through multiple network layers, resulting in low-level features.Feature information is seriously lost.Therefore, to preserve the rich low-level semantic information to the greatest extent, we design a dual-path criss-cross fusion strategy, which encodes feature maps of different scales from bottom-up and top-down directions.The approach shortens the transmission path of low-level information flow between layers while ensuring the integrity and diversity of features.In addition, it further strengthens the interaction between high-level semantics and low-level semantic information.The fusion steps of CCFM are as follows.
First, the feature maps f 0 s obtained by the LAG module is output to the 1 × 1 convolutional layer for feature compression.The bottom-up path is used for the transfer to get pyramid feature maps f P,s+1 of different scales, respectively (eg.P2 to P5 in CCFM in Fig 1).This features can be indicates as.
Secondly, the top-down path transfer is used to obtain pyramid features of different scales, and the criss-cross fusion strategy is used to get f T,2 , f T,3 , f T,4 and f T,5 .The fusion process is shown in the equation.
Where, Conv 3×3 (�) represents the convolution operation of 3 × 3; � represents feature stitching; Conv 1×1 represents the convolution operation of 1 × 1; Although this dual-path criss-cross fusion strategy effectively captures multi-scale spatial details, it also guides some redundant information.Therefore, we again introduce the LAG component to embed four different scales in each other to further improve the contextual semantics and global Representation performance of spatial details.The embedding process is shown in Eq.
Where, α T,s−1 , α T,s represents the local attention map; it is worth noting that we fuse the features of these four different scales into two feature maps of high resolution and low resolution, namely high-level and low-level features information of s = {2, 4}.
In addition, we use a simple feature stitching method to stitch these two feature maps to obtain the feature map that is finally used to represent the target object.The fusion is shown in the equation.

Detection module for aligned convolutional (ACDM)
This module includes aligned convolution filters, anchor refinement structure, and classification regression modules.Among them, the aligned convolution filter is aimed at the feature extraction module to obtain features for decoding.The anchor point refinement structure generates more accurate and higher quality anchor point boxes to improve classification accuracy.Simply put, these two components' primary purpose is to encode orientation information while reducing the inconsistency between classification and localization regression and enabling us to generate high-quality anchor and alignment features for accurately detecting sports objects in thermal infrared images.
To achieve the optimal performance of the proposed framework, we design a weighted loss function; the loss function is defined as the following equation.

Experimental results and analysis
To demonstrate the validity and reliability of the proposed LAGSwin detection framework, we use two open-source sports baseline data as experimental samples to evaluate the proposed method and the current, more advanced detection models.First, the data sources are introduced in detail, and the evaluation metrics and parameters are given; second, the ablation research and analysis discussions are presented.

Data preparation
TTsports [16].This data set is captured by a Q1922-type thermal camera.There are a total of 4 30-second indoor football sequences.For the consistency of the experiment, we rearranged the data set and generated a total of 1500 images after processing.1920 × 480 size image, each thermal infrared image contains eight different sports players.
FLIRs [37].This dataset was released in July 2018, with a total of 14, 000 images, 10, 000 of which are from short video clips, and another 4, 000 BONUS images are from a 140-second video.These thermal infrared images mainly include four categories of people, cars, dogs, and other vehicles.
RGBsports [16].This dataset contains 3000 RGB images, including 1874 footballs and 1126 crickets.For the fairness of the experiment, we randomly selected 40% of them as training samples, 10% as verification samples, and the remaining 50% were used to evaluate all detection models.It is worth noting that in data set processing, we use the overlap ratio method for cropping for data set expansion.At the same time, we use different ratios to adjust the scaling ratio.The scaling ratio parameters are set to 0.5, 1.0, and 1.5.

Parameter settings and evaluation Metrics
We adopt AdamW as the optimizer to tune and optimize the whole detection framework, where the learning rate is set to 0.0025, the number of iterations is set to 36, and the batch size is set to 16.Meanwhile, the pre-training of ImgaeNet22k is used to obtain better feature representation.The weights are parametrically tuned to the Swin-transformer module.In the training process, we adopted two training strategies, single-scale and multi-scale, in which the multi-scale size was set to 600 × 480 and 600 × 800, and the scale of the test phase was 600 × 600.
To ensure the smooth progress of the experiments, all experiments were completed on 4 RTX3090 of python3.7.6 and torch1.7.0+cu110, and the recall rate and mean average precision rate (mAP) were used as evaluation indicators.The calculation process is shown in the equation.
Where TP (True Positive) indicates that a detection frame with an intersection ratio (IoU) > 0.5 with the Ground Truth target frame is detected, it is worth noting that the same Ground Truth is only calculated once.FP (False Positive) indicates the number of detection boxes with the target box IoU < = 0.5 or the number of redundant detection boxes with the same Ground Truth detected.FN (False Negative) represents the number of target boxes that are not seen.mAP (mean Average Precision) represents the average value of each category of AP, and AP is to calculate the area under the P-R curve of a specific type.The larger the mapped value, the better the detection effect of the method.

Comparison with advanced methods
To demonstrate the effectiveness of the proposed LAGSwin detection framework, we conduct experimental comparisons on multiple detection models; Table 1 presents the experimental results of different methods.From Table 1, we can draw the following conclusions: (1) The thermal infrared sports target detection framework of LAGSwin we proposed has achieved the best detection performance on three open-source datasets, including TTsports, RGBsports, and FLIRs.For example, the mAP on the TTsports, RGBsports, and FLIRs data sets is 0.057, 0.017, and 0.025 higher than the ReDet method.The possible reason is that, on the one hand, we describe the targets in thermal infrared images in detail from three different scales and levels of low, medium, and high, and establish functional spatial and long-term dependencies between these features and use The local attention guidance layer highlights the details, allowing the network to better focus on the subtle changes between different types of objects, as well as the differences between objects and backgrounds.On the other hand, introducing convolutional filters in the detector, encoding orientation information while reducing the inconsistency between classification and localization regression, enables us to generate high-quality anchor and alignment features.
In addition, each component assists the network in obtaining the optimal feature representation, which ultimately leads to the optimal performance of the proposed model on the three datasets.
(2) Compared with single-stage detection methods such as SASM, Oriented RepPoints, and KLD, two-stage detection methods such as ReDet, Roi-transformer, and Oriented R-CNN have achieved strong competitiveness on these open source datasets.For example, the mAP of the Roi-transformer is 0.079, 0.039, and 0.039 higher than the Oriented RepPoints detection method, respectively.The mAP of Oriented R-CNN is 0.06, 0.068, and 0.055, higher than that of the KLD detection method.It may be that the two-stage detection method generates high-quality regions of interest in the first stage, which prompts the network to learn a better feature representation.In addition, the SASM detection method performed the worst on the three open-source datasets.The possible reason is that the SASM method focuses on the representation of object shape information while ignoring the extraction of high-level discriminative semantics.It is worth noting that the single-stage detection method Oriented RepPoints has achieved better competitiveness on the RGBsports and FLIRs datasets.For example, the mAP of Oriented RepPoints is 0.026 and 0.036 higher than Oriented R-CNN, respectively.It is possible that the Oriented RepPoints detection method uses adaptive point representation and dynamic evaluation and allocation strategies, which promotes the network to capture any instance-oriented geometric information effectively, and uses three orientation conversion functions to achieve accurate positioning of the target, and at the same time filters the feature point set Highlighting the representation improves the classification accuracy, which finally leads to the Oriented RepPoints detection method outperforming the Oriented R-CNN method.In addition, the current advanced single-stage detection algorithm Rtmdet and two-stage detection algorithm Diffu-sionDet have achieved good competitive advantages in the three datasets, such as mAP on the FLIRs data set are 0.815 and 0.799, respectively.The detection efficiency of Rtmdet is better than DiffusionDet.
(3) The proposed LAGSwin detection framework is still highly competitive in reasoning efficiency while ensuring optimal detection accuracy.For example, the FLOPs of LAGSwin are 1.8, 2.4, and 1.3 lower than single-stage detection methods such as Oriented RepPoints, SASM, and KLD, respectively.Still, our detection accuracy is far superior to these methods.
Compared with the two-stage detection method, our proposed LAGSwin detection framework achieves the best detection accuracy and inference efficiency (FLOPs).

Ablation studies
The component of LAGSwin framework.We use quantitative and qualitative methods to verify each part to prove whether each component in the proposed LAGSwin detection framework plays a positive role in the model.Table 2 presents the experimental results in different detail.From Table 2, we draw the following conclusions: (1) In our proposed LAGSwin thermal infrared sports object detection framework, each component plays a crucial positive role in the overall performance of the framework.On the three open source data sets, the backbone network using PVTV2 as the model has achieved a tremendous competitive advantage, such as the mAP of Resnet101+DCN increased by 0.01, 0.01, and 0.001, respectively.On the TTsports and FLIRs data sets, compared with Res2net101 mAP increased by 0.002 and 0.004, respectively, but decreased by 0.006 on the RGBsports dataset.The main reason is that PVTV2 benefits from the self-attention mechanism in Transformer and always maintains the global receptive field, ensuring the local semantics of the target and better acquisition of the target.The global details of the RGBsports data set may be reduced because the target scale changes significantly or the target background is more complex, which reduces the detection performance of PVTV2.In addition, HRNet obtained the worst detection performance, which was 0.007, 0.009, and 0.002 lower than MobileNetV2 on the three open-source datasets.The possible reason is that HRNet focuses on obtaining high-resolution feature information of the target, ignoring the rich low-resolution feature information.Hierarchical semantic information also shows that low-level semantic information helps detect thermal infrared sports targets.
(2) Compared with the external feature extraction network, the deep backbone network can better capture the features of thermal infrared sports targets.For example, on the FLIRs dataset, the mAP of Res2net101, Resnet101, and Resnest101 are 0.012, 0.007, and 0.003 higher than Res2net50, Resnet50, and Resnest50, respectively.Similarly, on the RGBsports and FLIRs datasets, the mAP of Resnet101+DCN is higher than Resnet50+DCN.0.01 and 0.008.The possible reason is that as the number of network layers deepens, the deep backbone network acquires more distinguishable high-level features, highlighting the differences between different types of targets.Compared with Resnet and Resnest backbone networks, the overall performance of Res2net has strong competitiveness.For example, on the TTsports dataset, the mAP of Res2net50 is 0.001 and 0.004 higher than Resnet50 and Resnest50, respectively.The same R-value performs poorly.In addition, as the number of network layers deepens, the detection accuracy of all data has improved.But on the RGBsports data, the mAP of Resnet101 and Resnest101 are equal.
(3) Using Res2net101 as the backbone network to replace the Swin-transformer in our model has achieved the best competitiveness.At the same time, on the RGBsports data, the mAP is 0.004 higher than that of the Swin-transformer.This may be because, in the local feature capture stage, res2net101 first divides the target features in the RGB image into multiple subspaces so that the network can obtain more detailed local semantics.Still, the R value is low, which may be the model with the deepening of the number of network layers leads to the utilization of a large amount of redundant information, which weakens the representation of the global semantics of the target.The model may also have fallen into a local optimum, and overfitting has occurred.
(4) IFEM+Backbone+CCFM+ACDM showed the worst performance on TTsports, RGBsports, and FLIRs data sets.The possible reason is significant spatial distribution and physical meaning differences between different scales.If you directly use the simple fusion method, It is challenging to balance the differences between them.Still, the LAG layer we designed plays an essential role using the weight distribution and residual strategy.IFEM +Backbone+LAG+PAFPN+ACDM has gained a better competitive advantage than IFEM +Backbone+LAG+FPN+ACDM.This may be because PAFPN also uses its own.The topdown and bottom-up two-way information transmission strategy reduces the loss of details caused by information transmission while preserving the global semantics to the greatest extent.In addition, the IFEM+Backbone+LAG+ACDM method achieved the worst performance on the three sets of data sets, with mAP of 0.808, 0.802, and 0.805, respectively, which shows that CCFM plays a positive role in the overall framework and is beneficial to the contextual detail features.The Backbone+LAG+CCFM+ACDM method has achieved better competitiveness.At the same time, it also shows that IFEM benefits the representation of prior knowledge.

Loss function.
To demonstrate that our proposed weighted loss function has a positive effect on the overall performance of the model, different loss functions are used to test on TTsports, RGBsports and FLIRs datasets.The experimental results are shown in Table 3.
From Table 3, we can find that when z FL is used to optimize and adjust the proposed detection framework, the detection performance has strong competitiveness in three groups of open-source datasets, such as TTsports, RGBsports, and FLIRs.For example, the mAP of z FL is 0.002, 0.002, and 0.003 higher than that of z MCE + z FL .The possible reason is an imbalance in the target categories in these data.The z FL loss function can effectively deal with the imbalance of categories, thus obtaining the best detection performance.Compared with the other two open-source datasets, the mce loss function performs the worst in RGBsports.It may be that the size of the target category in this data is small, and the scale change between different categories is small, as well as the existence of a category imbalance problem degrades the final detection performance.

Discussion
To visually demonstrate the effectiveness of the proposed LAGSwin detection framework, Fig 2 and Table 4 show the detection performance of the model for each category on different datasets of TTsports, RGBsports and FLIRs.
According to Table 4 we can find: (1) In the RGBsports data set, compared to the baseball class whose ID number is 0, the detection effect of football is significantly better than that of baseball, namely, AP and R have increased by 0.184 and 0.116, respectively.The possible reason is that the shape and size of the football are easy.Distinguishing from the image background enables the network to learn a more practical difference between the target and the background, thereby improving the detection performance of football.The smaller size of the baseball is easy to confuse with the target background and causes misclassification.In the FLIRs data, the otherscars class with ID number 3 achieved the worst performance, probably because this class has a small number of samples, and the slight difference between its cars caused misclassification.This reduces the detection accuracy and the performance of the detection framework for the cars class.For example, the APs of dog and people are 0.053 and 0.069 higher than cars, respectively.
(2) In the TTsports dataset, the detection accuracy of each class is very competitive.It is worth noting that although sportsman6 and sportsman3 have the same detection performance, AP is 0.838, but R is 0.919 and 0.925, respectively.The possible reason is that sportsman6 has occlusion during the movement, which reduces the R-value.A visual presentation of the different data is shown in Fig 2.

Conclusions and next research
In this paper, we propose a local attention-guided swain-transformer detection framework (LAGSwin) for detecting spatial details of moving objects in thermal infrared images.The  method first uses a feature extraction module guided by local attention to strengthen the interaction between low-level and high-level features.It embeds high-level features into low-level features so that high-level features contain rich low-level semantics.Embedding high-level features into low-level features makes low-level features more discriminative in high-level semantics.Secondly, design a cross-fusion strategy to aggregate these feature information from different directions, reduce redundant information while retaining spatial details to the greatest extent, and ensure the integrity and diversity of attribute information such as the physical appearance of the target; in the detection module, complete The feature alignment of the algorithm alleviates the inconsistency between regression and classification.Finally, evaluation tests were performed on three sets of open-source baseline data, including TTsports, FLIRs, and RGBsports, and optimal detection performance and good robustness were achieved.
During the experiment, we found that the design of the feature extraction module of the detection framework is relatively complicated, which increases the redundancy of the model.At the same time, the detection efficiency also has a lot of room for improvement.Therefore, in the following research, we will start from the above two aspects to design a simple and efficient semantic guidance network, that is, to lighten the feature extractor and design a more effective semantic fusion module., to preserve the spatial details of the target in the thermal infrared image to the greatest extent, and at the same time, use the new semantic fusion module to efficiently gather different levels of semantics to improve feature representation performance.

Fig 1 .
Fig 1.(a) The overall network structure of LAGSwin; (b) the network structure of the LAG module.Where ACDM represents the detection module of aligned convolution filters; CCFM represents the cross-fusion strategy; LAG represents the local attention guidance module; Stages 1 to 4 represent the four stages of Swin-transformer; P2 to P5 represent the top-down Pyramid structure features; T2 to T5 represent pyramid structure features from low to upper; � represent feature fusion; � represent matrix multiplication; sigmoid(�) represents activation function.https://doi.org/10.1371/journal.pone.0297068.g001

Fig 2 .
Fig 2. The detection effect of ours model on different dataset.(a) indicates the datasets of TTsports.(b) indicates the datasets of FLIRs.Different color detection boxes in the image represent different object classes.https://doi.org/10.1371/journal.pone.0297068.g002 Where, z h MCE represents the loss function of high-level features; z L MCE represents the loss function of low-level features; z FL represents the primary loss function of the proposed LSGSWin detection framework, which can effectively alleviate the problem of class imbalance; β, γ represent weight factors;Algorithm 1: The thermal infrared sports object detection process by our develop LAGSwin.

Table 1 . Experimental results of different detection methods
. "FLOPs" indicates the number of floating-point operations per second, and the unit is GM; "Parameter" indicates the number of parameters, and the unit is ×10M. https://doi.org/10.1371/journal.pone.0297068.t001

Table 2 .
Experimental results of different component.IFEM+Backbone+CCFM+ACDM demonstrates that the model does not use any local attention guidance layer; IFEM+Backbone+LAG+FPN+ACDM indicates that the FPN is used to replace the CCFM component; IFEM+Backbone+LAG+FPN+ACDM indicates that the PAFPN is used to replace the CCFM component.'Backbone' indicates the feature extractor of SWin-transformer. https://doi.org/10.1371/journal.pone.0297068.t002

Table 3 .
Experimental results of different loss function.z MCE indicates that only cross-entropy loss is used; z FL indicates that only Focal loss is used to adjust and optimize the entire network; z MCE + z FL indicates that simple weighted loss is used, that is, cross-entropy and Focal loss are used together to act on the network.

Table 4 . Experimental results for each category on different datasets of TTsports, RGBsports and FLIRs
. "Classes" indicates the object class in this datasets."AP" represents the average detection accuracy of the objects."ClassID" indicates the ID number of the category in the dataset. https://doi.org/10.1371/journal.pone.0297068.t004